Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array
نویسنده
چکیده
A compressed text database based on the compressed suffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log |Σ|) bits for the alphabet Σ. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in O(|P | logn + occ logǫ n) time and decompress a part of the text of length l in O(l+logǫ n) time for any given 1 ≥ ǫ > 0. Our data structure occupies only n( 2 ǫ ( 3 2 +H0+2 logH0)+2+ 4 log n logǫ n−1 )+o(n)+O(|Σ| log |Σ|) bits where H0 ≤ log |Σ| is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.
منابع مشابه
A Modified Burrows-Wheeler Transformation for Case-Insensitive Search with Application to Suffix Array Compression
Now the Block sorting compression [l] becomes common by its good balance of compression ratio and speed. It has another nice feature, which is the relation between encoding/decoding process and suffix array. The suffix array [2] is a memory-efficient data structure for searching any substring of a text. It is an array of lexicographically sorted pointers to suffixes of a text. It is also used f...
متن کاملA Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data
Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, th...
متن کاملCHICO: A Compressed Hybrid Index for Repetitive Collections
Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar...
متن کاملAdvantages of Backward Searching - Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays
One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H0 + O(log log σ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zero-order entropy of the text. The number of occurrences of a pattern of length m can be computed in O(m log n) time. Most notably, the CSA does no...
متن کاملALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases
We study the problem of local alignment, which is finding pairs of similar subsequences with gaps. The problem exists in biosequence databases. BLAST is a typical software for finding local alignment based on heuristic, but could miss results. Using the SmithWaterman algorithm, we can find all local alignments in O(mn) time, where m and n are lengths of a query and a text, respectively. A recen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000